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The efficiency of Monte Cario samplers is dictated not only by energetic effects, such as large 
barriers, but also by entropic effects that are due to the sheer volume that is sampled. The latter 
effects appear in the form of an entropic mismatch or divergence between the direct and reverse trial 
moves. We provide lower and upper bounds for the average acceptance probability in terms of the 
Renyi divergence of order 1/2. We show that the asymptotic finitude of the entropic divergence is 
the necessary and sufficient condition for non-vanishing acceptance probabilities in the limit of large 
dimension. Furthermore, we demonstrate that the upper bound is reasonably tight by showing that 
the exponent is asymptotically exact for systems made up of a large number of independent and 
identically distributed subsystems. For the last statement, we provide an alternative proof that relies 
on the reformulation of the acceptance probability as a large deviation problem. The reformulation 
also leads to a class of low- variance estimators for strongly asymmetric distributions. We show that 
the entropy divergence causes a decay in the average displacements with the number of dimensions 
n that are simultaneously updated. For systems that have a well-defined thermodynamic limit, the 
decay is demonstrated to be n -1 / 2 for random-walk Monte Carlo and n -1 / 6 for Smart Monte Carlo 
(SMC). Numerical simulations of the LJ38 cluster show that SMC is virtually as efficient as the 
Markov chain implementation of the Gibbs sampler, which is normally utilized for Lennard- Jones 
clusters. An application of the entropic inequalities to the parallel tempering method demonstrates 
that the number of replicas increases as the square root of the heat capacity of the system. 
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I. INTRODUCTION 

The Markov Chain Monte Carlo algorithms [l], Q sam- 
ple a configuration or phase space by means of an ergodic 
Markov chain that leaves the equilibrium distribution in- 
variant. Except for a few remarkable exceptions, such 
as Swendsen and Wang's cluster algorithm and follow- 
ups [H, 0, Q , the chain's transition matrix is constructed 
by means of the rejection technique, according to the 
Metropolis-Hastings prescription. Typically, the ensuing 
Markov chains are local in nature. In the limit of small 
displacements (nearest- neighbor moves on a lattice), the 
generated stochastic dynamics becomes an activated dif- 
fusive process, akin to the Smoluchowski dynamics. For 
such dynamics, the rate of equilibration is controlled by 
the mean escape times from the dominant basins, which 
times are proportional to exp(+/3AE), with (3 being the 
inverse temperature and AE being the energy barrier. 
The transition state theory can provide more accurate 
estimates, which involve free energy barriers. 

To attenuate the negative influence of the energetic 
barriers, one may attempt to alter the sampled distri- 
bution by means of carefully-chosen weighting factors, 
as is the purpose of umbrella sampling [7] and multi- 
canonical ensembles @. If such modifications arc not 
what is desired, the mixing times can be improved by 
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attempting larger jumps, ideally, directly through a bar- 
rier. However, examples of truly global algorithms are 
few and their existence depends on specific properties of 
the systems that are investigated. General-purpose algo- 
rithms that achieve non- locality, such as, J- walking [8|, 
simulated tempering Q , or replica exchange 0, flQlfri j , 
do so by means of extending the dimensionality of the 
system, therefore, introducing new parameters that pro- 
vide alternative communication routes. In the enlarged 
space, the algorithms remain local, making the following 
question pertinent to the design of algorithms: What de- 
termines the size of the jumps? In this paper, we provide 
the answer in the form of an entropic mismatch between 
the distribution that is sampled and the trial distribution 
utilized in the construction of the Monte Carlo transition 
matrix. 

For standard random walk Metropolis samplers, nu- 
merical observations evidencing the entropic nature of 
the displacement-limiting factors exist in the form of ex- 
perimental knowledge that updating many particles si- 
multaneously results in a decrease of the average dis- 
placements. Nevertheless, somewhat at odds with the 
observed dimensional sensitivity, the standard justifica- 
tion for the decrease has been that the likelihood that 
the particles collide is increased. Owing to the ensuing 
increase in the energy, high-dimensional moves are more 
likely to be rejected. It was recently realized that this 
explanation is erroneous and that the influence of the di- 
mensionality on the efficiency of samplers is an entropic 
effect [l2|. Numerical experiments have shown that the 
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effect features a behavior similar to that predicted for 
systems made up of independent particles, which parti- 
cles cannot collide. In the original work, the effect has 
not been properly quantified. Given the importance of 
the rejection-based Monte Carlo algorithms in the phys- 
ical sciences, the author feels that a proper analysis is 
desirable. More generally, acquiring tools necessary to 
quantify the effect of dimensionality is of definite impor- 
tance, as Monte Carlo methods are techniques intended 
to cope specifically with large dimensional systems. 

As mathematical technique, the point of start are 
some simple yet strong inequalities. Consider a d- 
dimensional system described by a probability density 
p(x), which we shall cast in the normalized Boltzmann 
form e~@ v ( x ' /Q(/3). The Metropolis-Hastings sampler 
utilizes the trial distribution T(y|x). The average accep- 
tance probability is given by the formula 
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In Section II, we show that 

iexp (-D 1/2 ) < A < cxp (--D 1/2 



(2) 



Here, Dx/2 > is the Renyi entropic divergence of order 
1/2 defined by the expression [l4| 



D 1 



/2 



2 log / dx / dy [p(x)T(y|xV(v)T(x| y ; 



,1/2 



(3) 

The divergence Dx/2 measures the mismatch between 
the direct sampling distribution p(x)T(y|x) and the re- 
verse distribution p(y)T(x\y). It is zero if and only if 
detailed balance is already achieved, an unlikely scenario 
in practical applications. As an extensive entropy, Dx/2 
is additive for independent probability distributions. As 
such, if we sample together a system made up of n in- 
dependent replicas from the product trial distribution 
T(yi|xi) ■ • • T(y„|x„), then the acceptance probability 
A n satisfies the inequality 



A n < exp 



- n -D 
2 



1/2 



(4) 



Such an exponential decrease has been first observed by 
Kennedy and Pendleton [l3[ following an analysis of the 
performance of Hybrid Monte Carlo simulations for har- 
monic oscillators. These authors demonstrated that the 
acceptance probability decreases exponentially unless the 
time step for the molecular dynamics integrator is suit- 
ably decreased as a fractional power of the number of 
degrees of freedom. To wit in view of Eq. ([4]), the choice 
of harmonic oscillators was fortunate, as their Boltzmann 
distribution is factorizable in the normal mode frame. We 
shall rederive Kennedy and Pendleton's results shortly, 
in other contexts. Nevertheless, there is nothing special 
about harmonic oscillators except for the guaranteed ex- 
tensivity of thermodynamic properties, which is, as ad- 
vocated here, the real penalty source. 



We have already presented some of the mathematical 
results so that to entice the reader's attention. The ex- 
ponential decay predicted by Eq. ^ automatically im- 
plies that the entropic divergence Dx/2 cannot be left 
unchanged. It must be decreased at least as fast as 1/n. 
This can be achieved by tuning some parameters. For ex- 
ample, we may decrease the average displacements for the 
random-walk Metropolis algorithm or the ratio between 
consecutive temperatures for parallel tempering. Details 
are left for Section IV. They prove to be consistent with 
prior analysis, wherever such analysis is available. In 
particular, the results for the random walk Metropolis al- 
gorithm and the Smart Monte Carlo method agree with 
those obtained by Roberts and coworkers [H, [lfl 
whereas the results for parallel tempering simulations 
agree with those of Kofke [18[ and Predescu et al [19( . 

A great deal of care is dedicated to the systems made 
up of a large number n of identical and independent sub- 
systems. Indeed, an analysis of the dependence with the 
dimensionality requires some form of homogeneity of the 
system in the thermodynamic limit. In addition, the as- 
sumption of independence is a model for the physical 
reality that sufficiently large subsystems interact weakly. 
More precisely, various properties become extensive in 
the thermodynamic limit because the contribution of the 
frontier interactions to these properties becomes compar- 
atively small. In Section II, we demonstrate that the 
inequality given by Eq. ([4]) is reasonably tight. More 
precisely, if the trial distribution T(y|x) utilized for sam- 
pling a subsystem is left unchanged as the dimensionality 
of the overall system is increased, then 



- lira n 1 log (AO = \d 1/2 . 

n — >oo £ 
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This last identity is the correct statement for the bad 
sampling theorem of Ref. [l2l which incorrectly asserted 
the law 



- lim n 1 log (AO = D x , 

n — >oo 

with D\ being the relative Shannon entropy 



Dx = ~ JdxJ dvp(x)T( y |x) log 



p(yMx| y ) 

P(x)T(y|x) 



(6) 



(7) 



Save for this correction, all predictions made in Ref. [T2 
with respect to the behavior of the average displacements 
as a function of dimensionality remain the same. The rea- 
son is that the asymptotic finitude of the relative Shan- 
non entropy Dx (also called the Kullback-Leibler diver- 
gence or the Renyi divergence of order 1) is a sufficient 
condition for non-vanishing acceptance probabilities. 

In Section III, we recast the evaluation of the average 
acceptance probability as a large deviation problem. We 
then utilize Cramer's large deviation theorem to provide 
a second demonstration of Eq. ([5]). The rare events as- 
sociated with large deviation problems are a source of 
exponentially increasing variances for quantities such as 
acceptance probabilities or ratios of partition functions. 
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To alleviate the overlap problem, a class of low-variance 
estimators for strongly asymmetric distributions is in- 
troduced, along with a demonstration of their statistical 
efficiency. 

In Section V, we find that the Smart Monte Carlo 
method of Rossky, Doll, and Friedman [2(| features an 
improved theoretical scaling of nT 1 ^ . Numerical verifi- 
cation recovers the predicted scaling for correlated sys- 
tems, namely embedded Lennard-Jones clusters. The 
findings are in agreement with the results of Roberts and 
Rosenthal [la ], who have obtained the same n" 1 / 6 scaling 
for systems made up of statistically independent subsys- 
tems. A simulation of the LJ38 cluster shows that Smart 
Monte Carlo is roughly as efficient as the Metropolis al- 
gorithm with single-particle updates, which is the tech- 
nique normally utilized for clusters. For more expensive 
potentials, it is argued that Smart Monte Carlo will likely 
be more efficient provided that the number of atoms up- 
dated simultaneously is in the range of tens to hundreds. 

Section VI contains the conclusions of the paper, which 
are mostly recommendations on the design of Monte 
Carlo algorithms so that to minimize the negative im- 
pact of the entropic effects. 



II. PROOF OF THE MATHEMATICAL CLAIMS 

For the reminder of this paper, the notation 7r(x, y) = 
p(x)T(y|x) turns out to be convenient. Notice that the 
trial distribution 7r(x, y) is normally strongly asymmet- 
ric and that its symmetry is equivalent to the detailed 
balance condition. In addition, we shall also need the 
antisymmetric function 



X(x, y) = log [?r(y, x)/tt(x, y)] , 



(8) 



which we regard as a random variable with respect to the 
probability distribution 7r(x, y). The expected value of 
some random variable Y , that is, the quantity 



J J dxdy7r(x,y)y(x,y), 



(9) 



will be denoted by E(V) or simply EF in order to avoid 
the more cumbersome integral notation. 
The average acceptance probability reads 

A = Emin {1, e x ) = Ee x ' 2 min \eT x l 2 , e x / 2 } . (10) 

Observe the equality 

min{e- x / 2 ,e x / 2 }=e-l x l/ 2 <l. (11) 

It follows that 

A = Ee x / 2 e-l x l/ 2 < Ee x ' 2 = , (12) 

which is the upper-bound inequality in Eq. 



To establish the other inequality, recall that l/x is a 
convex function on x € [0, 00). By virtue of Jensen's 
inequality, we have 



A = Ee x/2 - 



Ee x/2 
A helpful identity is 



X/2 -\X\/2 f Ee X/2 \X\/2 

> Ee x/2 



Ee x / 2 



Ee (x+|x|)/2 = Emax | l5e X| < E ( 1 + e X) 
Combining with Eq. (|13[) . we obtain 



(13) 
(14) 



A> Ee x / 2 



/ Ee (x+\x\)/2 > 1 ( Ee x/ 2 y 



which is the lower-bound inequality in Eq. (|2"|). 

Given their relevance for Monte Carlo algorithms, it is 
worthwhile to recall the definitional features of the Renyi 
divergences. For a > 0, the Renyi divergence of order a 
for continuous distributions p(x) and g(x) is given by the 
expression 



D a (p\\q) 



1 



a- 1 



log 



p(x) Q q(x) 1 -"dx 



(16) 



For a = 1, the Renyi divergence is defined by continuity, 
as the limit a — ► 1. The result, which can be obtained 
by means of l'Hopital's rule, is the Kullback-Leiblcr di- 
vergence (or the relative Shannon entropy) 



Di(p\\q) 



p(x)log[g(x)/p(x)]dx. (17) 



The Renyi divergences are non-negative quantities. They 
are zero if and only if p(x) and g(x) are identical ex- 
cept for a set of measure zero. As expected of entropies, 
the Renyi divergences are additive for independent dis- 
tributions. More precisely, ifp(x, x') = pi(x)p2(x') and 
q(x, x') = qi(x)g 2 (x') then 



D a (p\\q) = D a (pi\\qi) + D a (p 2 \\q 2 ), 



(18) 



a relation we shall utilize in the following section. 

As far as the theory developed in the present paper is 
concerned, of importance is the Renyi divergence of order 
1/2, which provides a measure of the mismatch between 
the direct p(x)T(y|x) and the reverse p(y)T(x|y) sam- 
pling probabilities. In words, the divergence quantifies 
the lack of detailed balance in a way that relates directly 
to the values of the acceptance probabilities, by virtue 
of the inequalities we have established. Notice that the 
divergence of order 1/2 is symmetric under the permu- 
tation of p and q. The common value is bounded from 
above by either of the Kullback-Leibler divergences 

D 1/2 (p\\q) = D 1/2 (q\\ P ) < min{Di(p||?),Di(g||p)}. 

This follows from Jensen's inequality and the convexity 
of the function — log(x). 
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A different lower bound for the acceptance probability 
helps demonstrate the limiting result given by Eq. (J5J), 
for independent systems. Let us denote by E S (Y) the ex- 
pected value of some random variable against the sym- 
metric probability distribution ir s (x, y) defined by the 
expression 



7r(x,y)exp[A(x,y)/2] 
Eexp(A/2) 



= [e Z5l / 2 7r(x,y) 7 r(y,x)] 



1/2 



(20) 

With this notation, the expression for the acceptance 
probability [see the left-hand side of Eq. (fTB")) ] becomes 



A= Ee x / 2 



,e-l x l/ 2 ) =e-^I 



-\X\/2 



(21) 



Since e x is a convex function, Jensen's inequality pro- 
duces 

A> e -i D ^ 2 exp(-E s \X\/2). (22) 
Cauchy's inequality implies E S |X| < (EgA 2 ) 1 / 2 and so, 



A > exp 



1d 1/2 -±(e s x 2 



(23) 



It is worth noting that (E^A 2 ) 1 / 2 is in fact a stan- 
dard deviation. The function A(x, y) is antisymmetric 
and, consequently, its expectation against the symmetric 
measure 7r s (x, y) is zero. To see the relevance of this ob- 
servation, assume that we sample together a large num- 
ber n of identical and statistically independent systems 
described by the same probability distribution p(x). As- 
sume also that we attempt to utilize a same trial distri- 
bution T(y|x), independent of n. The formula for the 
average acceptance probability reads 



A*. 



dxidyi ■ • ■dyL n dy n Tr{yii 1 y 1 ) • 



X7r(x n ,y„)mm< 1, I I — 



(24) 



(25) 



For the last identity, we have employed Eq. (|2"Tj) , with S„ 
defined accordingly by 



5*4x1, yi, . . . ,x„,y„) = log 
The identity 



7r(yi,Xj) 



n 



S„ = £>g 



7r(yi,Xi 



7r(xi,yj 



^2 A(x 4 ,y 2 ) 

i=l 



(26) 



(27) 



shows that S n is a sum of independent and identically 
distributed random variables Xi. By independence, iden- 
tical distribution, and the vanishing expectation of each 
Xi, 



E S S< 



sXiXj 



nEW 2 . 



(28) 



On the other hand, the additivity of the Renyi diver- 
gence of order 1/2 for independent distributions implies 

Ee" 5 " 72 = e - {n/2)Dl / 2 . (29) 

We obtain 

e -(n/2)D 1/2 > ^ > e -(n/2)L> 1/2 -(nE s X 2 ) 1/2 /2 (39) 

and conclude 

\d 1/2 < -ilog(A) < \d 1/2 + ^{E s X^. (31) 

This sequence of inequalities produces Eq. (JSJ) upon let- 
ting n — ► 00. We shall construct another proof for Eq. ([5]) 
in the next section, by means of Cramer's large deviation 
theorem. 



III. THE ACCEPTANCE PROBABILITY AS A 
LARGE DEVIATION PROBLEM 

The basic task of this section is to reformulate the eval- 
uation of the acceptance probability as a large deviation 
problem. By doing so, we obtain a better understand- 
ing of the source of the exponential decrease in the ac- 
ceptance probability for independent systems. We will 
find that the source is the rare-event sampling normally 
associated with large deviation problems. In fact, we 
shall construct a different proof for Eq. §5§ by means of 
Cramer's large deviation theorem [2lj . which quantifies 
the frequency of such rare events. In addition, we are led 
to the consideration of special low- variance estimators for 
the acceptance probability, which can be generalized to 
other scenarios, as done in Appendix I. 

We start with the following formula for the acceptance 
probability 



A = 



dxdy min {?r(x, y), 7r(y, x)} . (32) 



Let /a(x, y) be the indicator function of the set 

A = {(x,y) :7r(x,y) <7r(y,x)}, (33) 

and let Ia c (x, y) be the indicator function of the comple- 
ment of A. It follows that 

A = J J dxdy [7r(x,y)/ A (x,y) + 7r(y, x)Ia<= (x, y)] . 

Let B be the set of points such that 7r(x, y) = 7r(y,x). 
Use of symmetry in Eq. (f33|) produces lA"{x,y) = 
I A (y, x) + I B {x, y). As such, 

A = J J dxdy[7r(x,y)I A (x,y) +n(y,x)I A (y,x) 

+7r(x,y)J B (x,y)] =2 / / dxdyn(x, y)/ A (x, y) (34) 



<2x<fy7r(x,y)J B (x,y). 
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Notice that 7r(x, y) is symmetric if and only if A(x, y) = 
and that 7r(x, y) < 7r(y,x) if and only if X(x, y) > 
0. If follows that the acceptance probability A is the 
probability that A(x, y) = plus twice the probability 
that X(x,y) > 0. That is, 

A = P{X = 0) + 2P{X > 0). (35) 

For a system made up of a large number n of inde- 
pendent and identically distributed subsystems sampled 
together, Eq. (f35|) becomes 

A n = P(S n /n = 0) + 2P(S n /n > 0). (36) 

Recall the definition of S n given by Eq. (|27| . The division 
by n does not change the equalities or inequalities in 
Eq. ([35]) and is a formality that arranges Eq. ([36]) in the 
form typical of large deviation problems. 

By the strong law of large numbers, S n /n converges 
to minus the Kullback-Leibler divergence D\ given by 
Eq. ([7]). If the detailed balance is not satisfied almost 
everywhere, then —D\ < 0. Naturally, this implies 
P(S n /n = 0) and P{S n /n > 0) -> 0, as n -> oo. For 
large deviation problems, the decay of the last probabil- 
ities is exponentially fast. For the simple case discussed 
here, where S n is a sum of identical and statistically in- 
dependent random variables, the law of the exponential 
decay is exactly known and is given by Cramer's theorem, 
the application of which leads again to 

- lim n^logtAO- (37) 

with D1/2 defined by Eq. The details of the proof 
are given in Appendix II. 

Eq. (|34[) provides two different estimators for the ac- 
ceptance probability, the fifty-fifty average of which is 
that implied by Eq. |T]). The first one is given by Eq. (|55|) 
and is only adequate for large values of the acceptance 
probability (larger than 1/2). The second can be deduced 
by a similar reasoning, save for changing the sense of the 
inequality in the definition of the set A. It is given by 

A = P(X = Q) + 2 I I dxdy^(x,y)exp[A(x,y)]. 
J Jx<o 

(38) 

In words, we first test if the move is likely to be rejected. 
If the answer is positive, then we accumulate twice the ra- 
tio p(y)T(x|y)/p(x)T(y|x). We accumulate 1 if detailed 
balance is already satisfied and otherwise. The second 
estimator is much better behaved than the first if the ac- 
ceptance probability is small, which happens when many 
moves are likely to be rejected. In this case, the proba- 
bility for the event X < is large and we get adequate 
statistics. In fact, Eq. ([35]) implies P(X > 0) < 1/2, 
so that P(X < 0) > 1/2, meaning that the number of 
points necessary for the implementation of Eq. ([38]) is al- 
ways more than half the total number. In addition, the 
variance of the estimator is always more favorable, since 
exp[A(x, y)] is smaller than 1 whenever X < 0. In many 



cases, smaller means significantly smaller and the stan- 
dard deviation of the estimator turns out to be roughly 
proportional to the value of the acceptance probability 
itself. 

The strategy presented in the preceding paragraph has 
been utilized before by the present author and collabo- 
rators in the context of replica exchange methods for the 
design of partition function estimators that alleviate the 
overlap problem [24]. In fact, the source of the over- 
lap problem is a poor statistics related to the fact that 
P(X > 0) can be exponentially small. Motivated by 
these examples, we present and justify the general form 
of such estimators in Appendix I. 

The practical relevance of Eq. I|38p is for tuning the 
various parameters controlling a simulation. There, good 
accuracy over few Monte Carlo cycles is required in con- 
ditions in which most attempted moves are rejected. On 
a computer, the equality X = cannot be tested exactly 
in floating-point arithmetic, owing to inherent numerical 
errors. In addition, the estimating function implied by 
Eq. (|38|) is not necessarily smaller than 1 everywhere. As 
described in Appendix I, an estimator that also addresses 
these issues is provided by 

f 0, if A(x,y) >log(2), 

e min{0,X(x,y)} x J ^ jf |X( X ,y)| <log(2), (39) 

{ 2, ifX(x,y)<-log(2). 



IV. SCALING OF TUNING PARAMETERS 
WITH THE DIMENSIONALITY: EXAMPLES 

Eq. @ produces the necessary and sufficient condi- 
tion for the acceptance probabilities to remain finite upon 
large changes in various parameters, such as the dimen- 
sionality of the system. Let D 1 / 2 {d, a) denote the Renyi 
divergence of order 1/2 for a d-dimcnsional sampler uti- 
lizing a trial distribution additionally characterized by 
the parameter or the family of parameters a. The neces- 
sary and sufficient condition for nonvanishing asymptotic 
acceptance probabilities is the existence of values ad such 
that 

Di/ 2 (d,a d ) < M < oo, Vd>l. (40) 

In words, the sequence D 1 / 2 (d,ad) must be bounded. If 
D 1 / 2 (d, ad) has a subsequence increasing to infinity, then 
the corresponding acceptance probabilities will converge 
to zero, by the upper-bound inequality in Eq. (|2|). This 
establishes the necessity. Conversely, if D 1 / 2 (d,ad) is 
bounded from above by M, then the acceptance prob- 
abilities cannot fall below e~ M /2, by the lower-bound 
inequality. 

Albeit the index d does not have to be the dimension, 
the dependence with the dimensionality is the natural 
problem to study, owing to the additivity of the entropy 
divergence for independent systems and trial distribu- 
tions. The necessity for tuning the parameters a comes 
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from an expected unlimited increase in the entropic di- 
vergence if these parameters are kept constant. For defi- 
niteness, assume that we are given a chemically homoge- 
neous system. For proposals T(y, x) that are products of 
dimension-independent distributions, we expect the en- 
tropic divergence to increase linearly with the number of 
dimensions. Indeed, the proposal is already conditionally 
independent, whereas the particles making up a physical 
system must decorrelate in the thermodynamic limit. A 
better insight is provided by the Kullback-Leibler diver- 
gence, which, according to Eq. (Til?|) , bounds the Renyi 
divergence of order 1/2 from above. The behavior of 
D\ closely mimics the behavior of the Boltzmann-Gibbs 
entropy except for the case where the proposal distribu- 
tion introduces additional correlation. (Recall that the 
Boltzmann-Gibbs entropy and the Shannon entropy are 
given by the same formula, except for the multiplicative 
Boltzmann constant). Moreover, D\ is easier to evaluate 
numerically if verification of linearity is desired in order 
to test for finite-size effects. 

In the reminder of the section, we apply the condi- 
tion to the random-walk Metropolis algorithm and the 
parallel tempering method in order to demonstrate the 
versatility of the entropic inequalities in predicting the 
scaling behavior in various situations. The random-walk 
Metropolis algorithm is constructed from a proposal dis- 
tribution of the form 



T(x + z|x) = a^fia^z). 



(41) 



Here, /(z) is a normalized distribution invariant under 
the transformation z i— ► — z. For simplicity, we take /(z) 
to be symmetric in each of the variables Zi. In addition, 
we require that f(z) be a product of low-dimensional 
distributions, so that the entropic divergence does not 
increase faster than linearly. Typically, the scaling factor 
a > is tuned such that the acceptance probability lies 
somewhere between 20% and 50%. The entropic diver- 
gence Di/ 2 (d, a) is 



2 log a -1 / dx / dzf{a- 1 z) [p(x)p(x + z)] 



21og( /dx/dz[p(x)p(x + az)] 1/2 /(z) 



1/2 



For given d, the entropic divergence converges to zero as 
a — > 0. In the same limit, the trial distribution shrinks 
to a delta function. 

For small a, the leading term in the entropic diver- 
gence, as follows from a Taylor expansion around a = 0, 
is the Fisher entropy. More precisely, we have 



0p(x) 



dxi 



dx + 0(a 4 d), 
(42) 



where 



f(z)zfdz, l<i<d 



are the second-order moments of the kernel /(z). Un- 
der the assumptions we made about /(z), the first-order 
moments and the cross second-order moments are zero. 
The error has an implied linear-only dependence with 
the dimension owing to the assumed decorrelation of the 
particles in the thermodynamic limit. Eq. (|40p implies 
that ad must decrease as fast as c? -1 / 2 (more generally, 
as the inverse square root of the Fisher entropy). Since 
ad<Ji is the standard deviation in the direction i of the 
trial distribution, we see that the average displacements 
are proportional to the inverse square root of the num- 
ber of dimensions sampled together. That the scaling is 
correct has been verified numerically for Lcnnard- Jones 
clusters [IH and found to hold true. In addition, the 
mathematical analysis performed by Roberts et al flBI ] 
for independent systems leads to the same conclusion. 

The finitude requirement for the second-order mo- 
ments of the function f(z) sets some constraints on the 
length of the tails for proposal rules that are products 
of low-dimensional distributions. Normally, we would 
like to utilize such products in order to avoid introduc- 
ing artificial correlation between distant particles. Also, 
functions /(z) with long tails seem more adequate for 
potentials with rough topologies. A good compromise is 
provided by products of coordinate-scaled versions of the 
one-dimensional function 



/(z) = (i + z 2 r 3/2 /2 



(43) 



Albeit f(z) does not have a finite variance, its second mo- 
ment is only mildly divergent and the distribution seems 
appropriate up to a few tens of variables updated simul- 
taneously. The advantage over a finite-variance function 
of the form (1 + z 2 )-3/ 2 - e jg that random numbers dis- 
tributed according to f(z) can be cheaply generated by 
means of the identity £ = (f - l/2)/(f(l - £) 1/2 ), with 
£ uniformly distributed on (0, 1). The trial distribution 
just described appeared to be superior to the standard 
flat distribution in the Monte Carlo sampling of a path- 
integral implementation of the Onsager-Machlup formula 
[HI]. Nevertheless, any clear advantage of a long-tail dis- 
tribution is perhaps limited to rough potential surfaces. 

The second example we work out is for parallel tem- 
pering 0, dH , where swaps are attempted between two 
large dimensional systems characterized by the inverse 
temperatures j3' > (3 [l9j]. The entropic divergence is 
given by 



D 1/2 {d,R) = -2\og 



Q{P)Q{P) 



(44) 



where (3 = (f3 + /3')/2 and R = f3'/(3 > 1. Eqs. 10 and 20 
of Ref. HI lead to 



R—l 



D 1/2 (d,R) = 2 [j^j C { v d) ((3) + 0(d\R - 1| 3 ), 

where Cy\(3) is the potential contribution to the heat 
capacity of the physical system. 
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Again, we can compensate the linear increase charac- 
teristic of the heat capacity by letting the ratio R get 
close to 1 sufficiently fast. We must choose a dimension- 
dependent value Rd such that 



C { v d) ((3)(R d -lf <M<oo, 



(45) 



The last inequality predicts that the temperatures (3 and 
f3' must get close one to another as fast as dr 1 / 2 , a con- 
dition that is also sufficient. That the heat capacity is 
the main cause for the decrease in the temperature ra- 
tios has been demonstrated before by Kofke [18j |. The 
dT 1 / 2 scaling has been confirmed by numerous simula- 
tions [l9l. l24l and also subjected to specific mathematical 
analysis [l8l.[l9l|. 



V. ANALYSIS OF THE SMART MONTE 
CARLO METHOD 

The force-biased Monte Carlo methods [25[ attempt to 
improve their efficiency by sampling from a more suitable 
local distribution constructed with the help of the first- 
order derivatives of the potential. In one-dimensional 
notation, we have 

e -PV(y) = e -pv{ X )-pV'{ X ){y- X ) + Q y _ x fy (4g) 

The right-hand side must be multiplied by a windowing 
function that eventually determines the lengthscale on 
which the Taylor approximation is valid. The windowing 
function must also decay fast enough to infinity, so that 
to make the right-hand side integrable. It was originally 
taken to be constant on a symmetric interval centered 
about x. If, for ease of computation, it is taken to be a 
Gaussian, then 



-PV(y) 



-f}V(x)-f3V'(x)(y-x)-(y-x) 2 /2* 2 + Q n y _ x >2y 



Completing the square, we deduce that a suitable pro- 
posal distribution is 



T(y\x) 



1 f [y-x + 0a*V'(x)Y \ 

ex P s — f ■ ( 47 ) 



v / 2vr ( T 2 



2<T 



We notice that the trial move is the transition kernel for a 
Smoluchowski process (or generalized Brownian motion) , 
and this is clearly true if we only care to set a 2 = 2Dt, 
with D being the diffusion constant and t being the 
(short) transit time. Nevertheless, the equilibrium tem- 
perature is wrong, namely exp[— 2(3V(x)]. 

The Smart Monte Carlo method was introduced by 
Rossky, Doll, and Friedman [2(| not as a force-biased 
technique, but as a Metropolis-Hastings correction to a 
Smoluchowski process. The idea is that the latter process 
is already an ergodic sampler for the Boltzmann distri- 
bution. However, the short-time transition kernels that 
are numerically available are only asymptotically accu- 
rate. The transition kernel described by Eq. (|4"T|) . with 



the inverse temperature (3 replaced by (3/2, is character- 
istic of the stochastic Euler's integrator. Rao and Berne 
have subsequently modified the force-biased method 
and introduced a tuning parameter 9 to control the force 
bias. 

We follow their approach and define a class of transi- 
tion kernels depending on a parameter 9 and having the 
form 



1 



To(y|x) =n^H- 



[Vi 



9l3a 2 V t V{x)Y 



2aj 



,( 48 ) 

Notice that for multidimensional systems the transition 
kernel is a product of one-dimensional distributions. As 
discussed in the preceding section, we expect the en- 
tropic divergence to behave linearly in the thermody- 
namic limit, because the proposal does not introduce 
additional particle correlation over the Boltzmann dis- 
tribution. Let us find the value of 9 that maximizes the 
standard deviations cr; for large dimensional systems. In 
Appendix III, we show that the relative entropy D 1 / 2 (9) 
is given by the formula 



D 1/2 (9) =f3 2 {9- 1/2) 2 (J2 hViV(x)f| + 0(d|M| 4 ). 

(49) 

The brackets denote an average against the Boltzmann 
distribution and the quantity itself is again the Fisher 
entropy (of course, up to some scaling or multiplicative 
constants). 

Assume 9 =/= 1/2. From the asymptotic requirement 



(3 2 (9 - 1/2) 2 ( [viViV(x)] 2 )<M< 00, 



we learn that the maximal standard deviations scale as 



(50) 



if the acceptance probability is to converge to a non- 
vanishing value in the limit d — > 00. Even more, the 
dominant term in Eq. (|49[) has the same value whether 
9 = or 9 = 1. It follows that, for sufficiently large 
systems, the original force-biased Monte Carlo method 
we considered in Eq. 1|47[) exhibits essentially the same 
average displacements as the unbiased method. 

The Smart Monte Carlo method corresponds to the 
choice of parameter 9 — 1/2, which is the only value 
that cancels the dominant Fisher entropy. As argued in 
Appendix III, for 9 = 1/2, the decay of the entropic di- 
vergence is as fast as d|jcr|j 6 , but in general not faster. 
Again, the factor d is always linear, albeit it needs not 
be so for systems that do not have a well-defined ther- 
modynamic limit. The scaling of the standard deviations 
with the dimensionality improves to 



&i,d ~ Pi /a ' , 



(51) 
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where of are some asymptotic constants. This result 
has been obtained before by Roberts and Rosenthal [l6| 
for independent systems and suggested by the work of 
Kennedy and Pendleton [l3| on hybrid Monte Carlo. 

Incidently, the asymptotic expansion worked out in 
Appendix III shows that the cancellation of the term 
||er|| 4 is due to the special values of the moments of the 
Gaussian window. The force-biased method can also 
achieve the dr 1 ^ scaling provided that the original rect- 
angular window is modified so that to match the first 
5 moments of a Gaussian. The simplest replacement is 
furnished by scaled products of 



(27T) 



-l/2 g -z 2 /2 



-L 



V5.V5] 



(z). 



(52) 



The first term is Dirac's delta function, whereas the sec- 
ond is the indicator function for the interval [— yo, v5]. 
The utilization of the delta function does not pose pro- 
gramming problems. By its symmetry, the window is 
only part of the proposal step and conveniently cancels in 
the acceptance/rejection step, save for the normalization 
coefficient. A compactly supported window may allevi- 
ate some of the numerical issues associated with Euler's 
integrator, as the particles are prevented from moving 
arbitrarily far under harsh gradients. 

In the remainder of the section, as an illustration, 
we determine numerically the scaling of the tuned stan- 
dard deviations <7j j( j for Lennard-Jones clusters. As well- 
known, the potential energy is given by 



N p N p 

V tot (x) = ^V LJ (r ii )+£v c (r i ), 



i<j 



(53) 



with N p being the number of particles. As usual, Vlj ) 
is the Lennard-Jones potential describing the interaction 
between the particles % and j 



V LJ (rv 



4e LJ 



\ '13 



(54) 



Vc(i"i) is a confining potential of the form 



0, 

5-10 3 e w (||ri 



if 



Rr 



Rr 



\/R c ) , otherwise. 



(55) 

Thus, the cluster is confined to its center of mass R cm - 
The values of the Lennard-Jones parameters &lj and clj 
used are 2.749 A and 35.6 K. They are characteristic of 
the Ne atoms. 

The confining potential has been utilized before in 
Ref. [23[ for diffusion processes and found to be more 
adequate than a steeply-increasing high-order polyno- 
mial. The latter introduces numerical instabilities owing 
to the high gradients experienced by the atoms reach- 
ing the frontier. To overcome the tendency of SMC to 
stall if high gradients are met, we have coupled the sam- 
pler with a standard random-walk Metropolis algorithm. 
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FIG. 1: Scaling of the standard deviations with the number 
of particles n updated simultaneously for the LJgi cluster at 
different low temperatures. A confining radius of Agl.j is 
utilized. The updated particles are chosen randomly. The 
observed n -1//4 behavior is caused by surface effects, which 
induce an artificial increase in the standard deviations for 
small n. The scaling is expected to improve to n _1,/6 , in the 
thermodynamic limit. 



The latter has been randomly utilized 25% of the time. 
The coupling with a random-walk Metropolis algorithm 
also serves a second purpose. Christensen et al [17| have 
pointed out that SMC is slow to attain stationarity if 
run by itself and that an n -1 / 4 scaling is more like to be 
observed in the transient regime. 

The typical strategy is to simulate a cluster having a 
large number of particles N p , while utilizing updates of 
the whole system. In the second phase of the simulation, 
the underlying all-particle sampler is kept running. A 
smaller number of particles n are attempted to be up- 
dated and the acceptance probability for such moves is 
accumulated by means of the estimator discussed in Sec- 
tion III. Because the standard deviations for the proposal 
distribution keep changing, the attempted moves are not 
accepted, as this would alter the detailed balance. The 
computed acceptance probabilities are utilized to tune 
the standard deviations for the hypothetical n-particle 
sampler so that the final acceptance probability lies be- 
tween 39% and 41%. Since the moves are not accepted, 
we can choose those n particles that are the closest to 
the center of mass. The coating provided by the exterior 
atoms minimizes the surface effects. Otherwise, if the n 
particles are chosen randomly, then the standard devia- 
tions for small n are artificially increased, owing to the 
larger mobility of the atoms at the surface. As shown 
in Fig. [TJ the surface effects create the appearance of an 
yi -1 / 4 scaling. 

Fig. [3 shows the common standard deviation of the 
trial distribution for different numbers n of embedded 
particles that are updated simultaneously and for differ- 
ent values of the parameter 9. As n increases, the value 
6=1/2 becomes the optimal one, in agreement with the 
theoretical predictions. For n = 51 and n = 91, it is also 
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FIG. 2: Dependence of the standard deviations with the pa- 
rameter 9 for various numbers of particles updated simulta- 
neously. The updated particles are those closest to the center 
of mass of a 500-atom cluster. A confining radius of 5.5cflj 
has been utilized. The ^-dependence expressed by Eq. (I49p 
implies that the plotted curves become symmetric in the ther- 
modynamic limit and feature a \6 — 1 / 2 1 1 singularity. 
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FIG. 3: Dependence of the observed acceptance probabilities 
with the number of particles n updated simultaneously, for 
standard deviations scaled according to the n _1,/6 law. As for 
Fig. [5] the updated particles are those closest to the center of 
mass of a 500-atom cluster. In both cases, the Lennard- Jones 
potentials have been cut off beyond 3o\lj. The error bars are 
given by the second significant digit. 



apparent that the standard deviations for the unbiased 
(6 = 0) and fully biased (6 = 1) Monte Carlo techniques 
are almost equal, again, in agreement with the theory. 
For small n, the plots are biased toward larger 6, which 
suggests that the guidance provided by the force bias is 
energetically favorable, yet eventually hindered by the 
entropic effects. 

Despite significant computational effort, we were not 
able to evaluate the standard deviations with sufficient 
accuracy for a proper demonstration of the n -1 / 6 scaling. 
The reason is as follows. Even if we increase the num- 
ber of particles updated simultaneously in an exponential 
fashion, say n = 2 k , the standard deviations for succes- 
sive n differ by a meager 12.2%. To correctly pinpoint 
the scaling, the relative error in the standard deviations 
needs to be about 2%. This is difficult to achieve by 
tuning. Many blocks are necessary, and the blocks must 
contain sufficiently many steps that the block averages 
for the acceptance probabilities have adequate statistical 
errors. However, a different approach is equally valid and 
computationally less expensive. We optimize the scaling 
parameters for the largest number of particles we plan 
to update (here, n — 91), and then decrease the num- 
ber of atoms, say by 10. For n = 81, the theory says 
that the acceptance probability remains constant if the 
standard deviations for n = 91 are increased by a factor 
of [(n + 10) /n] 1 / 6 . The procedure is repeated down to 
n = 1. To confirm the scaling, it suffices to check that the 
acceptance probabilities do remain constant. Since the 
standard deviations are kept constant for a given n, the 
many block estimates utilized for tuning can now be av- 
eraged to produce accurate estimates for the acceptance 
probabilities. The results presented in Fig. [3] adequately 
confirm the theoretical scaling. 

We have utilized the Smart Monte Carlo method to 



sample the LJ38 cluster, which is a notoriously diffi- 
cult problem, especially when a large confining radius 
R c = 3.612ctlj is utilized. We have implemented SMC 
in the all-particle version and run it for a number of 1.25 
billion steps. The number of steps has been chosen so 
that the computational cost is the same as for the simu- 
lation performed in Ref. [2?]. There, a standard Metropo- 
lis sampler has been run for 360 million sweeps, each 
sweep being composed of 38 single-particle Metropolis 
moves. Even though parallel tempering is utilized, the 
LJ38 cluster exhibits excessively long equilibration times. 
Heat capacities are overly sensitive to the lack of equili- 
bration and serve as a good indicator for the progress of 
the simulation. The six curves shown in Fig.[3]have been 
evaluated by accumulating data over 175 million SMC 
steps. They follow the same history as the curves pre- 
sented in Fig. 1 of Ref. 27. Comparison shows that the 
Smart Monte Carlo method is marginally faster. There- 
fore, the simulation demonstrates the capability of the 
SMC method to sample difficult systems in the range of 
tens of atoms. For more complicated potential energies, 
the evaluation of the forces may be significantly faster 
than the evaluation of the potential differences for single 
particle moves, making the SMC method more advanta- 
geous (a non-trivial advantage is also the simplicity of 
the ensuing codes and the parallclization opportunity of- 
fered by the force-field). A more elaborate discussion is 
provided in the following section. 



VI. CONCLUSIONS 

The present study allows us to make several general 
recommendations on the design of algorithms. We be- 
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FIG. 4: The evolution of the heat capacity profiles is similar 
to that from Fig. 1 of Ref.[27|. Each of the six curves has been 
obtained by collecting averages for groups of 175 million steps. 
Early in the simulation, the heat capacity develops a fake 
maximum, which is subsequently transformed in a "shoulder." 
The last four curves fluctuate around this shoulder, with the 
fluctuations providing an estimate for the errors. 

gin by considering the case of a single-particle sampling 
strategy. The approach can be regarded as a Monte Carlo 
implementation of a Gibbs sampler ■ The latter is an 
idealized technique whereby the new coordinates for a 
particle are proposed directly from the conditional distri- 
bution of the particle. Obviously, a Gibbs sampler has a 
built-in correlation that sets a limit on the performance 
of Metropolis algorithms that update only a few vari- 
ables at a time. As such, we cannot indefinitely improve 
the efficiency of the algorithms by increasing the average 
displacements for the trial distribution. For example, in 
the case of a Lennard- Jones cluster, a particle essentially 
moves in a cage created by its environment. Even if we 
sample independently from the conditional distribution 
of the particle, we still propose moves mostly in this cage, 
which is the statistically important region. 

Coping with the Gibbs correlation requires updating 
more than one particle at a time. To set things in 
perspective, in the limit that all particles are updated 
and that the displacements are comparable with the size 
of the cluster, the sampling becomes almost indepen- 
dent. Unfortunately, by comparison with the Monte 
Carlo Gibbs-like sampler described in the preceding para- 
graph, the displacements are further decreased by the 
cntropic effects. In the limit of small displacements, 
the Metropolis sampler is essentially diffusive. Conse- 
quently, if the decrease in the displacements is n -Q , we 
will roughly need n 2a steps to cover distances comparable 
to those for a Gibbs-like sampler. 

Despite its intrinsically local nature, an all-particle 
Metropolis sampler can be more efficient than a Gibbs 
sampler if the paths followed by the latter involve en- 
ergetic barriers that are significantly larger than those 
characteristic of paths close to the minimum energy path. 



The latter paths are available to an all-atom sampler and 
more so to the Smart Monte Carlo method, whereas a 
Gibbs sampler may have little choice in effectuating a 
rare transition other than placing the particles succes- 
sively on the edges of the conditional cages. In case of 
high cooperativeness for the minimum energy path [281 ] , a 
large number of particles must be correctly positioned in 
order to cause a successful transition. Unfortunately, the 
transition probability is exponentially decreasing with 
the aforementioned number of particles, because prob- 
abilities for individual particles multiply. 

Which strategy must be utilized depends also on the 
computational cost to implement a given step. For gen- 
eral potentials, there might be the case that updating a 
single particle costs the same as updating all particles. If 
this is true, then it seems more advantageous to utilize 
the all-particle Metropolis algorithm. Indeed, we need 
n calls to the potential function to defeat the entropic 
effects and span reasonable distances. The Gibbs-like 
sampler makes n calls to the potential function to com- 
plete a sweep. However, the first technique is in principle 
not limited by the formation of cages. Nevertheless, in 
practice, the situation is more complicated because up- 
dating a single particle rarely requires a full evaluation 
of the potential. For example, a Lennard-Jones system 
can be updated one particle at a time in such a way that 
the cost for a complete sweep is twice the cost for an 
all-particle move. The gain in computational efficiency is 
large enough that the Monte Carlo implementation of the 
Gibbs sampler is the default sampler for Lennard-Jones 
clusters or liquids. Owing to its superiority in dealing 
with the entropic effects, the Smart Monte Carlo method 
is a good candidate for sampling subsystems containing 
tens or hundreds of atoms. 

To summarize, the ideal sampling strategy would di- 
vide the system in small subsystems (which may overlap) 
in a way that optimally compromises on the following re- 
quirements: i) the correlation between the subsystems 
be as small as possible and ii) the dimensionality of each 
subsystem be as small as possible. An efficient implemen- 
tation may require the utilization of variables other than 
Cartesian ones, as the subsystems may become nearly un- 
corrected in such coordinates. Each subsystem, which 
is described by strongly correlated degrees of freedom, 
is sampled utilizing an all-particle strategy or the Smart 
Monte Carlo method, if forces are easily computable. Fi- 
nally, if techniques of reducing the dimensionality of the 
system are not available, then the designer should intro- 
duce tuning parameters that can arbitrarily increase the 
similarity between the direct and reverse sampling dis- 
tributions. A case is constituted by the replica-exchange 
techniques. If tuning parameters are not provided, then 
one unavoidably ends up with an algorithm that sud- 
denly stops working for systems of sufficiently large di- 
mensionality, owing to the severe exponential decay in 
the acceptance probability. 

The present study seems to favor coarse-graining and 
constrained-sampling as techniques capable of dealing 
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with the negative impact of the dimensionality on the 
maximal displacements. Such approaches reduce the ef- 
fective dimensionality to that of the space of coarse vari- 
ables, ideally described by their marginal distributions, 
or to that of the manifold defined by the set of con- 
straints. Resolution exchange [2!| can restore exactness 
in the first case, while at the same time improve the 
rare-event statistics, owing to the superior computational 
performance for the coarser levels of resolution. In this 
context, it appears that the estimators for the average 
acceptance probabilities derived here may prove useful 
in determining with reliability the degree of overlap be- 
tween replicas of different resolutions. 

Finally, on a separate topic, the influence of dimen- 
sionality on the global properties of molecular dynam- 
ics integrators is an issue that has to be more carefully 
investigated. As Kennedy and Pendleton have shown, 
Hybrid Monte Carlo, albeit guaranteeing exactness by 
means of detailed balance, requires time steps that de- 
crease as a fractional power with the dimensionality of 
the system. The cause is precisely the entropic diver- 
gence documented here, this time between the phase- 
space Boltzmann distributions of the true and shadow 
Hamiltonians. Thinking that systems of 10 6 degrees of 
freedom are typical nowadays, a decrease in the time step 
by a factor of 10 is not trivial. However, more worrisome 
seems to be the reverse scenario, where Hybrid Monte 
Carlo is not utilized. It is not at all clear whether or not 
the extensive entropic divergence may lead to differences 
between the exact and the integrator's statistics of such 
a nature that the validity of the latter as an approxima- 
tion becomes questionable. Such a scenario could hap- 
pen if the energy error is not equally divided between 
the degrees of freedom, and instead is transferred in an 
unfavorable way to crucial low-dimensional subsystems. 



Acknowledgments 

This work was supported in part by the National Sci- 
ence Foundation Grant No. CHE-0345280 and by the 
Director, Office of Science, Office of Basic Energy Sci- 
ences, Chemical Sciences, Geosciences, and Biosciences 
Division, U.S. Department of Energy under Contract No. 
DE-AC02-05CH11231. NERSC has provided the com- 
puting resources. 



APPENDIX A: LOW- VARIANCE ESTIMATORS 
FOR STRONGLY ASYMMETRIC 
DISTRIBUTIONS 

Sometimes, a sampling problem appears naturally in 
the form 



with /(x, y) symmetric and 7r(x, y) positive. The dis- 
tribution at the denominator is assumed unnormal- 
ized for the sake of generality. The default estimator 
/(x, y)/7r(x, y) might register large contributions to the 
variance arising from the pairs (x, y) where 7r(x, y) is 
small. This is apparent from the formula for the second 
moment, which is 



J/dxdy/(x,y) 2 A(x,y) 
/ / dxrfy7r(x,y) 



(A2) 



Nevertheless, for strongly asymmetric sampling distribu- 
tions, 7r(y,x) might be large for such pairs. It turns 
out that it is possible to construct estimators for which 
the variance behaves as if the sampling distribution were 
replaced by max{7r(x, y), 7r(y, x)} and which incur little 
penalty over the default estimator if the sampling distri- 
bution is not strongly asymmetric. 
Recall 

X(x,y) = log[7r(y,x)/7r(x,y)] (A3) 
and let e > 0. Define the sets 

A e = {(x,y) :X(x,y)<-e} (A4) 

and 

B e = {(x,y):|X(x,y)|<e}. (A5) 

Notice that 

I A c (x, y) = I Ar (y, x) + I Be (x, y), (A6) 

a property that stems from the antisymmetry relation 
X(x, y) = — A"(y,x). Therefore, we can write 

J J dxdyf(x,y) = J J dxdyf(x, y)I Ac (x, y) 

+ J J dxdyf(x,y)I A ,(y,x) (A7) 

+ J J dxdyf(x,y)I Bc (x,y). 

The symmetry of /(x, y) implies that the first two terms 
of the right-hand side are equal, so that 



dxdy/(x,y)=2 J J dxdyf(x, y)I Am (x, y) 

+ j J dxrfy/(x,y)7 Be (x,y).(A8) 
This leads to the estimating function 



^T^xl 1, if|X(x,y)| <e, 
-( x <y) 1 2, ifX(x,y)<-e. 



(A9) 



f f dxdyf(x,y) 
f J dxdyn(x,y)' 



(Al) 



The estimator given by Eq. (|3"5|) is obtained by set- 
ting 7r(x, y) equal to p(x)T(y|x) and /(x, y) equal to 
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min{7r(x, y), 7r(y, x)}. The partition function estimator 
given by Eq. (12) of Ref. is obtained by setting 7r(x, y) 
equal to ^(x)^ (y) and /(x,y) equal to p j (x)p j (y). 

In order to avoid a divergent variance caused by small 
values of 7r(x, y), it suffices to let e be smaller or equal to 
log(2). Indeed, on the set |A(x, y)| < e, the inequality 



7r(y,x) < e e 7r(x,y) < 27r(x,y) 



implies 



27r(x, y) > max{7r(x, y), 7r(y, x)} . 

On the other hand, on the set X(x, y) < — e, 

7r(x,y) = max{7r(x,y),7r(y,x)} . 

Therefore, the square of the estimating function is 
smaller than 



2/(x,y) 



0, ifX(x,y)>e, 
, x <J 1, if |X(x,y)| <e, 

7r(x,y)max{7r(x,y),7r(y,x)} 1 2 ifx(x,y)<-e, 

(A10) 

the expected value of which is 



J f dxdy/(x,y) 2 /max{7r(x,y),7r(y,x)} 
f f dxdyir(x,y) 



(All) 



by the symmetry of the integrand in the numerator. The 
value of Eq. (|A11[) can be at most twice larger than the 
second moment of the default estimator. However, it may 
actually be significantly smaller for strongly asymmetric 
distributions, for which max{7r(x, y), 7r(y, x)} may often 
be substantially larger than 7r(x, y). Notice that Eq. 
is still valid and implies P(X < 0) > 1/2. Thus, the 
penalty of at most 2 has a statistical origin. For nearly 
symmetric distributions, we may end up utilizing only 
half (but not less) of the total number of points sampled. 

Among the possible values of e < log(2), the choice 
e = log(2) has one additional desirable property. If the 
ratio /(x, y)/7r(x, y) is bounded everywhere by a con- 
stant M > 0, the modified estimator should ideally ex- 
hibit this property as well. An example is provided by the 
modified estimator for the acceptance probability, which, 
desirably, should be less or equal to 1 everywhere. The 
choice e = log(2) guarantees the boundness property. 
The proof is as follows. The modified estimator can only 
be out of bounds on the set X(x, y) < — log(2), possibly 
because of the multiplicative factor of 2. However, on 
this set, the inequality 7r(y,x) < 7r(x,y)/2 holds. The 
boundness by M then follows from the relations 

2 l/(*,y)| = J/(y,x)| < |/(y,x)| < M 
7r(x,y) 7r(x,y) 7r(y,x) 



APPENDIX B: PROOF OF EQ. ([37]) BY 
CRAMER'S THEOREM 

As otherwise the statement of Eq. (1571) is trivially true, 
we shall assume that the set of points for which the de- 



tailed balance condition is violated 

p(x)T(y|x) ? p(y)T(x|y) 



(Bl) 



has nonvanishing probability. The rejection mechanism 
in the Metropolis-Hastings method has the purpose of 
correcting the lack of detailed balance of the trial dis- 
tribution. Owing to Eq. I|B1[) . we have D\ > 0. Recall 
Eq. |8]) and consider the moment generating function 



0(0)=Eexp(0X) = J J 7r(x,y) 
Observe that the first derivative 



7r(y,x) 
7i"(x,y) 



//"fry) 



7r(y,x) 



7r(x,y) 



log 



7r(y,x) 
7r(x,y) 



dxdy. 
(B2) 

dxdy 



is defined on the interval [0,1]. In fact (f>'(0) = — D±, 
0'(1) = D u and 0'(l/2) = 0. The last relation follows 
easily by symmetry. If a is such that —D\ < a < Di, 
then the equation a — (p'(6 a )/(p(6 a ) has a unique solution 
in the interval (0, 1), owing to the fact that <f)' (0) / (f)(6) is 
increasing. The reader can look at the moment generat- 
ing function and see that the derivative of <f>' (6) / (f)(6) is a 
"heat capacity." Moreover, Eq. (|B1|) demands that this 
heat capacity be continuous and non-zero on (— D±,Di). 
From the implicit mapping theorem, it follows that the 
solution 6 a is continuous in a. Of note is that 6q = 1/2. 

According to Cramer's theorem, which is stated at the 
end of the appendix, we have 



lim — log P(S n /n > a) 

n — >oo fi 



-a6 a + log (f>(6 a ), 



for all a e (-Di,Di). If a > 0, then, by Eq. (J3BJ), 

2P(S n /n >0)>A n > 2P(S n /n > a), 



from where it follows that 



log 0(1/2) > lim sup — log A n 

n — >oo Tl 



> lim inf — log A n > 



- dd a + \0g (f>(6 a ). 



Letting a — > 0, by means of the aforementioned continu- 
ity, we obtain the desired result, namely 

lim -log A n = log 0(1/2) = -\d 1/2 < 0. (B3) 

To see that Z?i/2 > 0, observe that the function 
f(x) — x 1 / 2 is concave. Jensen's inequality as applied to 
Eq. (|B2|) produces 0(1/2) < 1, with equality if and only 
if 7r(x, y) = 7r(y, x) holds almost surely, which is contrary 
to our assumption. Nevertheless, if the last equality holds 
almost surely, then Eq. (|37p is trivially true, since A n = 1 



and D 1 



/2 



0. 



To help the reader understand the proof, we give the 
statement of Cramer's large deviation theorem [2l| . 
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Theorem 1 Let {X n ;n > 1} be a sequence of indepen- 
dent and identically distributed random variables and let 
S n = Xi be the partial sums. Let /i — EAi and 

pick a > fi. Suppose 4>(9) = Ecxp(O^) is defined on 
the interval [0,0+) and that the distribution of Xi is not 
a point mass. Assume also that there is 9 a £ (0,9+) so 
that a = <f>' {9 a) / <f>{9 a) ■ Then as n — » oo, 

log P(S„ > na) -> -a9 a + log (j>{6 a ). 



APPENDIX C: PROOF OF EQ. (JUD 

We first consider the case d = 1. Write D 1 / 2 (9) in the 
form Dx/ 2 (9) = —2\ogS(9). Following the substitution 
y = x + az, S(9) is given by the expression 



S{9) 



dx / dze -P\V(*)+V( X +za)}/2 



xe -[z+ef3aV'(x)} 2 /4 e -[z-ei3aV'(x+za)} 2 /4^ 



By interchanging the order of integration and performing 
the substitution x' = x + za/2, S(9) is seen to equal 



1 



dx 



dze -§lV(x-za/2) + V(x+za/2)] 



xe -[z+00aV (x-za/2)} 2 /4 e -[z-80aV (x+za/2)} 2 /4 ^ ^) 

The integrand in this equation is symmetric with respect 
to the variable z. Rearrangement of the exponent pro- 
duces 



S{9) 



dxe- pv{x) -^— 
Q(P)J m Xe 



xe 



xe 



xe 



dze- z2 / 2 



-@[V(x-za/2)+V{x+z<7/2)-2V(x)}/2 
9/3za[V' {x+zo /2)-V' (x-za/2)]/2 

2 2 a 2 [V'(x-za/2) 2 +V'(x+za/2) 2 ]/4 



(C3) 



The reason we write S(9) in this form is that a appears 
either in the form a 2 or as a product za. Since the odd 
moments of the Gaussian measure vanish, it follows that 
the Taylor expansion at a — contains only terms of the 
form o~ 2k . Evaluating the first two nonvanishing terms 
explicitly, we obtain as an intermediate result 



S(9) = 1 - 



1 



Q(0) 



tee-™ 



2 /?V 



V'{xf 



+ 0{a A ). (C4) 



Integration by parts produces 

dxe- pv[x) V"{x) = I dxe- pv ^(3V'{xf. 



Eq. l[C4)) reduces to 

8(0) = 1 - 02 ° 3 ^] /2)2 I dxe-^V'(x) 2 + 0(a% 

(C5) 



Since D 1/2 (9) = -2\ogS{9), we obtain 

D 1/2 (9) = (3 2 a 2 {9 - 1/2) 2 {V'{x) 2 ) p + O^ 4 ), (C6) 

where we have utilized the bra-ket notation to denote 
thermodynamic averages for the Fisher entropy. For the 
e?-dimensional case, the reader can reduce the problem to 
the form specified by Eq. (|C3[) . perform the expansion, 
and then notice that all averages involving cross terms 
ZiZj with i j vanish. One eventually obtains the form 
specified by Eq. |49)) . 

For 9 — 1/2, we can actually show that the decay of 
the entropy divergence is faster than the quartic rate im- 
plied by Eq. (|C6|l. The decay is as fast as 0(||a|j 6 ), but 
in general not faster. Expensive yet straightforward cal- 
culations show that the quartic term is given by the ther- 
modynamic average of the sum over 1 < i, j ' < d of the 
terms 



64 



{3[3d Ujj V{x) +(3 2 [-2%y(x)%F(x) 



+d ii V{x)d jj V{x) - 45 < y(x)% i y(x)] (C7) 
-2/3 3 [a,V(x)] 2 ^T/(x) +f3 4 [d l V( X )] 2 [d J V( X )} 2 } . 

Again, we utilize integration by parts to replace the in- 
tegrands with equivalent ones. The goal is to remove the 
integrands containing high order derivatives. By means 
of the identities 

{dujjVWp = (diV^dijjV^))^ 

= [3 2 <[fiiV(x)] a %V(x)) - (3 (duVWdjjVWp , 



we can replace Eq. (|C7[) with 
a 2 a 2 

-jr± {2f3 2 d lJ V( X )d lJ V( X ) - 2/3 2 d u V(x)d J1 V(x) 
+3{1 3 [d i V(x)] 2 d jj V(x.) - /3 4 [a^(x)] 2 [9^(x)] 2 } . 
Furthermore, the equality 

([d t V(x)] 2 d, 3 V(x)) p = (3{[d i V{x)?[d j V{x)?) p 
-2(9^(x)%l/(x)a,V r (x)) /3 
leads to the equivalent integrand 



32 



{/? 2 %V(x)%^(x) - f3 2 d ll V( X )d JJ V( X ) 



+l3 s [d l V{*)] 2 d 0J V{*) - /? 3 cW(x)c^(x)c^(x)} . 

Letting cf>(x) — exp[—[3V(x)/2), it is straightforward 
to show that the term of order ||cr|| is the sum over all 
1 < h 3 < d °f the terms 



8Q(f3) 



{0«0(x)a^(x) - [%</>(x)] 2 } dx. (C8) 



However, the terms are equal to zero because of the iden- 
tities 

9, i 0(x)9 JJ -0(x)(ix = / <j>(x)dujj<j)(x)dx 



dij^(x)dij4>(x)dx, 
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which follow by integration by parts. 

We conclude that, if = 1/2, the dominant term in 
the expansion of the divergence Di/ 2 (9) in powers of a 
is 0(||cr|| 6 ). For the quadratic potential V(x) = 2x 2 and 



for 0=1, numerical results show that D 1 /2(&)/<J 6 — > 1, 
as (7 — > 0. Thus, in general, we cannot hope for a better 
scaling. 
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